Get Data

a = read_csv('raw_payroll_data.csv')
a %>% head
a %>% glimpse
## Rows: 104
## Columns: 5
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ dept_id   <dbl> 6, 4, 3, 4, 2, 6, 3, NA, 3, 6, 6, 6, 4, 4, 6, 3, 5, 4, 4,...
## $ dept_name <chr> "Production", "Business Development", "Design", "Business...
## $ name      <chr> "B****, Li*** N.", "C******, Jane G.", "E******, Ma* B.",...
## $ salary    <chr> "91100", "173900", "163100", "71000", "111100", "80600", ...
#simple cleaning
#a = a %>% select(-id, -name)

a = a %>% mutate(
  salary = as.numeric(salary),
  dept_id = as.factor(dept_id)
)

#dept_id should be a factor/category variable
#especially import to convert if imputation is done

a %>% head

Exploratory Data Analysis

Check zeros, na, inf

a %>% visdat::vis_dat()

a %>% funModeling::status()
a %>% DataExplorer::plot_missing()

Check counts, levels

a %>% count(dept_name, dept_id, name = 'count') %>% mutate(percent = count/sum(count)) %>% arrange(dept_name, -percent)
a %>% select(dept_name) %>% freq
a %>% count(name) %>% arrange(-n)

Some dept names have been misspelled. Some names also have duplicates .

Correct dept name misspellings & Remove duplicate names

#correct dept misspellings
a = a %>% mutate(
  dept_name = if_else(dept_name == 'Business Developement', 'Business Development', dept_name),
  dept_name = if_else(dept_name == 'Producdion', 'Production', dept_name),
) %>% mutate(
  dept_name = factor(dept_name)
)

#check
a %>% select(dept_name) %>% freq
#remove rows with duplicate names
a = a %>% distinct(name, .keep_all = TRUE)

#check
#a %>% n_distinct()/a %>% nrow()

Data Imputation

provided mapping reference

(dept.mapping = tibble(
  dept_id = a %>% pull(dept_id) %>% unique() %>% sort,
  dept_name = c('Human Resources', 'Design', 'Business Development', 'Accounting', 'Production'),
))

Note!! I could use the provided dept mapping (replicated above) to [perfectly] impute the missing data for both dept_id & dept_name via some clever find and replacing.

But . . . there’s a faster and more automated way! Imputation via Machine Learning, specifically the Random Forest Algorithm (both Classification and Regression).

missing data reference

# these rows are missing for dept_name
a %>% filter(is.na(dept_name))
missing.dept_name.ids = a %>% filter(is.na(dept_name)) %>% pull(id)

# these rows are missing for dept_id
a %>% filter(is.na(dept_id))
missing.dept_id.ids = a %>% filter(is.na(dept_id)) %>% pull(id)

# these rows are missing for salary
a %>% filter(is.na(salary))
missing.salary.ids = a %>% filter(is.na(salary)) %>% pull(id)

actual imputation of dept_id, dept_name, salary

library(tidymodels)

rec = a %>% recipe(salary ~ .) %>% 
  step_bagimpute(dept_id, dept_name, impute_with = imp_vars(dept_id, dept_name)) %>% 
  step_bagimpute(salary, impute_with = imp_vars(dept_id, dept_name))

a.imputed = rec %>% prep %>% juice
#a.imputed %>% DataExplorer::plot_missing()

proof of perfect Random Forest Imputation

original reference

dept.mapping

observations with imputated dept name

a.imputed %>% filter( id %in% missing.dept_name.ids )

observations with imputated dept id

a.imputed %>% filter( id %in% missing.dept_id.ids )

observations with imputated salary

a.imputed %>% filter( id %in% missing.salary.ids )

avg and median salary by department

a.imputed %>% group_by(dept_name) %>% summarise(
  mean.salary = mean(salary),
  median.salary = median(salary)
)
## `summarise()` ungrouping output (override with `.groups` argument)

Observation: For individuals who had missing ‘salary’ info, the random forest algorithm replaced the missing salary salary with the median salary of the department they belonged to.

Determining Statistical Signifiance of Differences

visualize salary distributions by department

a.imputed %>% 
  plot_ly(x = ~dept_name, y = ~salary, color = ~dept_name) %>%
  add_boxplot(quartilemethod = 'inclusive') %>% 
  hide_legend() %>% 
  layout(
    title = 'Distribution of Salary by Department',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    )

chart salary distributions by department

a.imputed %>% group_by(dept_name) %>% summarise(
  q1 = quantile(salary, 0.25),
  q2 = quantile(salary, 0.50),
  q3 = quantile(salary, 0.75),
)
## `summarise()` ungrouping output (override with `.groups` argument)

Is there an sig diff between the median salaries of the depts?

anova.salary.dept = aov(salary ~ dept_name, a.imputed)
anova.salary.dept %>% tidy

Conclusion: Yes, the p values is less than 5%

Even Further Investigation

For what specific pairwise-comparisons is there a sig diff?

list all pairwise-comparisons

TukeyHSD(anova.salary.dept)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = salary ~ dept_name, data = a.imputed)
## 
## $dept_name
##                                            diff        lwr        upr     p adj
## Business Development-Accounting        9080.400 -27368.406 45529.2059 0.9575614
## Design-Accounting                     46343.787  11206.158 81481.4153 0.0036298
## Human Resources-Accounting             6247.651 -37663.538 50158.8409 0.9947385
## Production-Accounting                 13175.588 -19715.006 46066.1816 0.7986801
## Design-Business Development           37263.387   5869.816 68656.9577 0.0116113
## Human Resources-Business Development  -2832.749 -43809.521 38144.0243 0.9996893
## Production-Business Development        4095.188 -24761.259 32951.6345 0.9947901
## Human Resources-Design               -40096.135 -79911.125  -281.1455 0.0475253
## Production-Design                    -33168.199 -60349.663 -5986.7345 0.0087443
## Production-Human Resources             6927.937 -30918.749 44774.6222 0.9862736

list only statistically significant pairwise-comparisons

TukeyHSD(anova.salary.dept) %>% tidy %>% filter(adj.p.value < 0.05) %>% select(contrast, adj.p.value)

Clearly ,there is financial mismanagement/ fraudulent activity occurring in the Design Department.

Looking at the salary distribution by dept viz above, the results above make sense

The ‘Design’ dept clearly has a median pay above that of the other departments. Even more telling, its first quartile, at $103,250 is nearly higher than the median of 3 other departments.